Skip to content

feat(AGX1-274): record task creator identity and FGAC migration safety#246

Open
asherfink wants to merge 2 commits into
mainfrom
asher.fink/agx1-274-task-dual-write
Open

feat(AGX1-274): record task creator identity and FGAC migration safety#246
asherfink wants to merge 2 commits into
mainfrom
asher.fink/agx1-274-task-dual-write

Conversation

@asherfink
Copy link
Copy Markdown

@asherfink asherfink commented May 21, 2026

Related work

Parent epic: AGX1-264 — per-task FGAC. Follow-ups bundled in AGX1-291.

This change is part of a 5-PR stack across 3 repos. Merge order: scaleapi/scaleapi#144783 (release sgp-authz 0.7.1) → scaleapi/agentex#353 → scaleapi/agentex#356 → this PR → #249.

Repo PR Purpose
scaleapi/scaleapi scaleapi/scaleapi#144783 sgp-authz 0.7.1 — Action.CANCEL
scaleapi/agentex scaleapi/agentex#353 agentex-auth per-account routing + cancel op
scaleapi/agentex scaleapi/agentex#356 agentex-auth register_resource API + cancel cleanup
scaleapi/scale-agentex #246 (this PR) task creator audit columns + FGAC dual-write + flag
scaleapi/scale-agentex #249 per-RPC operation rewire + 404/403 wrap

Two commits — keep them separate during review, the audit-column schema change is independent of the dual-write call sites.

Summary

Commit 1 — passive audit columns:

  • Adds creator_user_id / creator_service_account_id columns to the tasks table, populated from the request principal on AgentTaskService.create_task. Best-effort (NULLable; see caveat below).
  • Adds a CHECK ((creator_user_id IS NULL) OR (creator_service_account_id IS NULL)) to enforce at-most-one creator type at the DB layer (constraint name: ck_tasks_at_most_one_creator).
  • Adds partial indexes ix_tasks_creator_user_id and ix_tasks_creator_service_account_id (CREATE INDEX CONCURRENTLY) for future "tasks created by X" lookups.

Commit 2 — FGAC dual-write call sites + flag:

  • Adds an FGAC_TASKS_DUAL_WRITE env-var flag, injected into AgentTaskService via FastAPI DI. Gates the dual-write behavior end-to-end.
  • create_task calls register_resource(task, parent_resource=agent) on the authorization service after the Postgres row is persisted, so the task is registered with tenant + owner + parent_agent tuples atomically (via scaleapi/agentex#356's new endpoint).
  • delete_task calls deregister_resource(task) after the Postgres delete. Pre-resolves the task id by name first so the post-delete deregister doesn't race the lookup.
  • Both call sites share a _dual_write_with_retry(op_name, do_call, task_id) helper. Retries AuthenticationServiceUnavailableError / AuthenticationGatewayError with exponential backoff + jitter (3 retries → 4 total attempts max), mirroring AgentsACPUseCase.grant_with_retry. Non-transient exceptions are not retried.
  • Emits Datadog metrics (task_fgac_dual_write.attempt|success|retry|failure) tagged with op:register|deregister and exception_class:<name> on failure — these are the rollout signal for AGX1-291's operator runbook.

Migration safety

  • ALTER TABLE ... ADD CONSTRAINT ... NOT VALID + ALTER TABLE ... VALIDATE CONSTRAINT — splits the operation so the brief ACCESS EXCLUSIVE lock doesn't have to wait on an existence scan. tasks is high-write; a CHECK addition without NOT VALID would queue behind in-flight transactions and block readers until released.
  • Indexes created CONCURRENTLY in an autocommit_block.
  • Migration revision: a1f73ada66c5 (add_task_creator_columns). down_revision is 6c942325c828 (adding_task_cleaned_at, the current alembic head on main); migration_history.txt regenerated via alembic history. The ORM-side CheckConstraint in orm.py matches the DB-side (same constraint name + predicate).

Rollout

  • Flag-off (default): no behavior change. Audit columns populate but no FGAC tuples are written. Safe to merge and deploy.
  • Flag-on: register_resource and deregister_resource fire on create/delete. If they fail after retries, the Postgres row is still the durable record — orphan auth tuples can be cleaned up out of band per the AGX1-291 operator runbook using the creator-audit columns to identify them.
  • Operator rollout assumes a redeploy cycles pods; the flag is read once at DI-resolve time, so mid-process flips are intentionally invisible.

Audit-trail caveat

Creator attribution is best-effort: tasks created outside an HTTP request context (Temporal activities, background workers, any path that constructs AgentTaskService without request.state.principal_context) leave both columns NULL. The CHECK constraint allows both-NULL, and test_no_resolvable_creator_leaves_both_columns_null exercises this path.

What changed

  • database/migrations/alembic/versions/2026_05_21_1508_add_task_creator_columns_a1f73ada66c5.py (new): NOT VALID-pattern migration. down_revision = "6c942325c828".
  • src/adapters/orm.py: declarative CheckConstraint mirroring the DB constraint.
  • src/domain/entities/tasks.py: new optional fields on TaskEntity.
  • src/domain/services/task_service.py:
    • _principal_field helper (handles dict-vs-pydantic principal shape from the authn proxy).
    • create_task reads creator_user_id / creator_service_account_id from principal context.
    • AgentTaskService.__init__ takes dual_write_enabled: DEnvironmentVariable(EnvVarKeys.FGAC_TASKS_DUAL_WRITE).
    • _dual_write_with_retry(op_name, do_call, task_id) keyed by op name; reused from both call sites.
  • src/adapters/authorization/adapter_agentex_authz_proxy.py: forwards to agentex-auth's /v1/authz/register and /deregister.
  • src/config/environment_variables.py: new FGAC_TASKS_DUAL_WRITE key.
  • Tests:
    • test_task_audit_columns.py — testcontainers Postgres integration tests for the audit columns (creator population, mutual-exclusion CHECK, both-NULL allowed).
    • test_task_fgac_dual_write.py — covers register-on-create, deregister-on-delete, flag-off skip, transient retry-and-succeed (both register and deregister sides), retry exhaustion propagating with the Postgres row preserved, and the name-route ItemDoesNotExist swallow.
    • Existing unit/integration tests updated for the new dual_write_enabled constructor parameter.

Test plan

  • migration_lint.py — clean.
  • Ruff + ruff-format + alembic migration-safety lint clean (pre-commit hooks).
  • test_task_audit_columns.py — 7/7 pass locally via testcontainers.
  • test_task_fgac_dual_write.py — collects cleanly; runs in CI integration suite.
  • Manual: deploy to staging with flag off, confirm \d tasks shows new columns + constraint + indexes; flip flag on for one account, confirm task_fgac_dual_write.success fires.

Greptile Summary

This PR adds two features: nullable creator_user_id/creator_service_account_id audit columns to the tasks table (populated from the request principal on task creation), and a flag-gated FGAC dual-write path that registers/deregisters tasks in the agentex-auth authorization graph on create/delete. The dual-write is off by default for safe incremental rollout.

  • Commit 1 (schema): Adds creator_user_id + creator_service_account_id columns, a CHECK constraint enforcing at-most-one creator type, and concurrent indexes — but the NOT VALID + VALIDATE CONSTRAINT pair runs inside the same Alembic transaction, negating the stated migration-safety benefit (the ACCESS EXCLUSIVE lock is held for the full validation scan, not just the constraint-addition step).
  • Commit 2 (dual-write): Wires AgentTaskService to call register_resource after task_repository.create and deregister_resource after task_repository.delete, with an exponential-backoff retry helper (3 retries, mirroring grant_with_retry) and Datadog counters for rollout observability. Test coverage is thorough across both audit-column population and dual-write behavior.

Confidence Score: 3/5

Safe to merge the application-layer changes (dual-write logic, audit-column population, new env flag, tests), but the migration needs correction before running against production — it will block all reads and writes on the tasks table for the full duration of the CHECK constraint validation scan.

The dual-write service code, retry helper, ORM changes, and test suite are well-structured and low-risk. The migration is the blocker: the NOT VALID + VALIDATE constraint pair runs inside a single Alembic transaction, so the ACCESS EXCLUSIVE lock acquired by ADD CONSTRAINT is held through the entire table scan. On a large, high-write tasks table this is the production-impact scenario the migration comment explicitly set out to avoid. The fix (wrapping both constraint statements in their own autocommit_block, or splitting into two revisions) is straightforward but must land before the migration is applied in any environment with meaningful data volume.

agentex/database/migrations/alembic/versions/2026_05_21_1508_add_task_creator_columns_a1f73ada66c5.py — the constraint locking pattern needs to be corrected before this migration runs in production.

Important Files Changed

Filename Overview
agentex/database/migrations/alembic/versions/2026_05_21_1508_add_task_creator_columns_a1f73ada66c5.py Adds creator audit columns and a CHECK constraint via NOT VALID + VALIDATE pattern, but both constraint statements run in the same Alembic transaction, negating the stated ACCESS EXCLUSIVE lock-duration benefit; also creates full indexes on nullable columns where partial indexes would be more efficient.
agentex/src/domain/services/task_service.py Adds FGAC dual-write call sites for register/deregister with exponential backoff retry; logic is sound and well-tested, though register failure after exhausted retries propagates to callers while the task row already exists.
agentex/src/adapters/authorization/adapter_agentex_authz_proxy.py Adds register_resource and deregister_resource HTTP calls to agentex-auth; consistent with existing grant/revoke patterns.
agentex/src/adapters/authorization/port.py Adds abstract register_resource and deregister_resource methods to the AuthorizationGateway port; clean extension of the interface.
agentex/src/adapters/orm.py Adds creator_user_id and creator_service_account_id columns and a CheckConstraint matching the migration; ORM and DB constraint names are aligned.
agentex/src/domain/services/authorization_service.py Adds register_resource and deregister_resource methods with bypass support and structured logging; mirrors existing grant/revoke patterns cleanly.
agentex/tests/integration/use_cases/test_task_fgac_dual_write.py Comprehensive dual-write integration tests covering register-on-create, deregister-on-delete, flag-off skipping, retry-then-succeed, and exhausted-retry propagation with row persistence.
agentex/tests/integration/use_cases/test_task_audit_columns.py Integration tests validating user/service-account principal column population, both-NULL case, and CHECK constraint violation; tests both dict and namespace principal shapes.

Sequence Diagram

sequenceDiagram
    participant Client
    participant AgentTaskService
    participant TaskRepository
    participant AuthorizationService
    participant AgentexAuthProxy

    Client->>AgentTaskService: create_task(agent, task_name, ...)
    AgentTaskService->>AuthorizationService: principal_context (read)
    AgentTaskService->>TaskRepository: create(task + creator_user_id/creator_service_account_id)
    TaskRepository-->>AgentTaskService: TaskEntity

    alt dual_write_enabled
        AgentTaskService->>AgentTaskService: _dual_write_with_retry(register)
        loop retry up to 3x on transient error
            AgentTaskService->>AuthorizationService: "register_resource(task, parent=agent)"
            AuthorizationService->>AgentexAuthProxy: POST /v1/authz/register
        end
    end
    AgentTaskService-->>Client: TaskEntity

    Client->>AgentTaskService: delete_task(id/name)
    alt dual_write_enabled AND name-only lookup
        AgentTaskService->>TaskRepository: "get(name=name)"
        TaskRepository-->>AgentTaskService: task_id
    end
    AgentTaskService->>TaskRepository: delete(id/name)
    alt dual_write_enabled AND task_id resolved
        AgentTaskService->>AgentTaskService: _dual_write_with_retry(deregister)
        AgentTaskService->>AuthorizationService: deregister_resource(task)
        AuthorizationService->>AgentexAuthProxy: POST /v1/authz/deregister
    end
    AgentTaskService-->>Client: None
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
agentex/database/migrations/alembic/versions/2026_05_21_1508_add_task_creator_columns_a1f73ada66c5.py:40-51
**NOT VALID + VALIDATE in the same Alembic transaction negates the stated safety guarantee**

Both `ADD CONSTRAINT ... NOT VALID` and `VALIDATE CONSTRAINT` execute inside the same default Alembic transaction. PostgreSQL holds transaction-level locks until the transaction commits, so the `ACCESS EXCLUSIVE` lock acquired by `ADD CONSTRAINT ... NOT VALID` is never released until the entire `upgrade()` function commits — meaning the full-table validation scan runs while `ACCESS EXCLUSIVE` is still held. On a large `tasks` table this blocks all concurrent reads and writes for the same wall-clock duration as a plain `ADD CONSTRAINT` without `NOT VALID`. The CONCURRENTLY indexes are correctly placed inside `autocommit_block` (each statement gets its own autocommit transaction), but the constraint pair is not.

To achieve the stated goal, both constraint statements need to run in separate transactions — either by wrapping them together inside their own `autocommit_block` (each `op.execute` becomes its own transaction in autocommit mode, so `NOT VALID` commits with a brief `ACCESS EXCLUSIVE` and `VALIDATE` then runs under `SHARE UPDATE EXCLUSIVE` with concurrent writes allowed), or by splitting into two separate revisions.

### Issue 2 of 3
agentex/database/migrations/alembic/versions/2026_05_21_1508_add_task_creator_columns_a1f73ada66c5.py:27-34
The indexes on nullable `creator_*` columns will be full indexes, meaning every NULL row is included. Since the vast majority of existing rows and ongoing non-HTTP-context rows will have both columns NULL, a partial index with `WHERE ... IS NOT NULL` would be significantly smaller and faster for the intended "tasks created by X" lookup pattern.

```suggestion
        op.execute(
            "CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_tasks_creator_user_id "
            "ON tasks (creator_user_id) WHERE creator_user_id IS NOT NULL"
        )
        op.execute(
            "CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_tasks_creator_service_account_id "
            "ON tasks (creator_service_account_id) WHERE creator_service_account_id IS NOT NULL"
        )
```

### Issue 3 of 3
agentex/src/domain/services/task_service.py:143-151
**Register failure propagates to the caller while the task row exists**

When `_dual_write_with_retry` exhausts all retries, `AuthenticationServiceUnavailableError` propagates out of `create_task`. The Postgres row is committed and the task exists, but the caller receives a 5xx. An API client following typical retry-on-error behavior will retry the entire `create_task` call and create a second task row for the same intent — both rows will have audit columns populated but at most one (the second) will be registered in the auth graph. This is documented as acceptable and covered by the AGX1-291 runbook, but it is worth flagging that the orphan-detection approach (scan by `creator_user_id`) will not distinguish duplicate rows from a single legitimately-orphaned row if the same principal retried.

Reviews (1): Last reviewed commit: "feat(AGX1-274): task FGAC dual-write cal..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

@asherfink asherfink force-pushed the asher.fink/agx1-274-task-dual-write branch from 13fe4b2 to 7486e5a Compare May 26, 2026 20:22
@asherfink asherfink changed the title feat(AGX1-274): dual-write tasks to spark-authz behind FGAC_TASKS_DUAL_WRITE flag feat(AGX1-274): record task creator identity and FGAC migration safety May 26, 2026
@asherfink asherfink force-pushed the asher.fink/agx1-274-task-dual-write branch from 7486e5a to b9cb26b Compare May 26, 2026 20:56
asherfink added 2 commits May 27, 2026 17:08
…n creation

Adds two nullable creator-audit columns to the tasks table — creator_user_id
and creator_service_account_id — populated from the principal context at
create time. A CHECK constraint (ck_tasks_one_creator) enforces that at most
one is set.

This replaces the earlier dual-write draft: grants are already issued
unconditionally via grant_with_retry in agents_acp_use_case.py:239, and
per-account rollout routing belongs in agentex-auth (private), not in this
public Apache-2.0 codebase.
@asherfink asherfink force-pushed the asher.fink/agx1-274-task-dual-write branch from ad1e980 to 3a06be8 Compare May 27, 2026 21:15
@asherfink asherfink marked this pull request as ready for review May 27, 2026 21:59
@asherfink asherfink requested a review from a team as a code owner May 27, 2026 21:59
Comment on lines +40 to +51
op.execute(
"ALTER TABLE tasks ADD CONSTRAINT ck_tasks_at_most_one_creator "
"CHECK ((creator_user_id IS NULL) OR (creator_service_account_id IS NULL)) "
"NOT VALID"
)
op.execute("ALTER TABLE tasks VALIDATE CONSTRAINT ck_tasks_at_most_one_creator")


def downgrade() -> None:
op.drop_constraint("ck_tasks_at_most_one_creator", "tasks", type_="check")
with op.get_context().autocommit_block():
op.execute(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 NOT VALID + VALIDATE in the same Alembic transaction negates the stated safety guarantee

Both ADD CONSTRAINT ... NOT VALID and VALIDATE CONSTRAINT execute inside the same default Alembic transaction. PostgreSQL holds transaction-level locks until the transaction commits, so the ACCESS EXCLUSIVE lock acquired by ADD CONSTRAINT ... NOT VALID is never released until the entire upgrade() function commits — meaning the full-table validation scan runs while ACCESS EXCLUSIVE is still held. On a large tasks table this blocks all concurrent reads and writes for the same wall-clock duration as a plain ADD CONSTRAINT without NOT VALID. The CONCURRENTLY indexes are correctly placed inside autocommit_block (each statement gets its own autocommit transaction), but the constraint pair is not.

To achieve the stated goal, both constraint statements need to run in separate transactions — either by wrapping them together inside their own autocommit_block (each op.execute becomes its own transaction in autocommit mode, so NOT VALID commits with a brief ACCESS EXCLUSIVE and VALIDATE then runs under SHARE UPDATE EXCLUSIVE with concurrent writes allowed), or by splitting into two separate revisions.

Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/database/migrations/alembic/versions/2026_05_21_1508_add_task_creator_columns_a1f73ada66c5.py
Line: 40-51

Comment:
**NOT VALID + VALIDATE in the same Alembic transaction negates the stated safety guarantee**

Both `ADD CONSTRAINT ... NOT VALID` and `VALIDATE CONSTRAINT` execute inside the same default Alembic transaction. PostgreSQL holds transaction-level locks until the transaction commits, so the `ACCESS EXCLUSIVE` lock acquired by `ADD CONSTRAINT ... NOT VALID` is never released until the entire `upgrade()` function commits — meaning the full-table validation scan runs while `ACCESS EXCLUSIVE` is still held. On a large `tasks` table this blocks all concurrent reads and writes for the same wall-clock duration as a plain `ADD CONSTRAINT` without `NOT VALID`. The CONCURRENTLY indexes are correctly placed inside `autocommit_block` (each statement gets its own autocommit transaction), but the constraint pair is not.

To achieve the stated goal, both constraint statements need to run in separate transactions — either by wrapping them together inside their own `autocommit_block` (each `op.execute` becomes its own transaction in autocommit mode, so `NOT VALID` commits with a brief `ACCESS EXCLUSIVE` and `VALIDATE` then runs under `SHARE UPDATE EXCLUSIVE` with concurrent writes allowed), or by splitting into two separate revisions.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment on lines +143 to +151
if self.dual_write_enabled:
await self._dual_write_with_retry(
op_name="register",
do_call=lambda: self.authorization_service.register_resource(
AgentexResource.task(task_entity.id),
parent_resource=AgentexResource.agent(agent.id),
),
task_id=task_entity.id,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Register failure propagates to the caller while the task row exists

When _dual_write_with_retry exhausts all retries, AuthenticationServiceUnavailableError propagates out of create_task. The Postgres row is committed and the task exists, but the caller receives a 5xx. An API client following typical retry-on-error behavior will retry the entire create_task call and create a second task row for the same intent — both rows will have audit columns populated but at most one (the second) will be registered in the auth graph. This is documented as acceptable and covered by the AGX1-291 runbook, but it is worth flagging that the orphan-detection approach (scan by creator_user_id) will not distinguish duplicate rows from a single legitimately-orphaned row if the same principal retried.

Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/domain/services/task_service.py
Line: 143-151

Comment:
**Register failure propagates to the caller while the task row exists**

When `_dual_write_with_retry` exhausts all retries, `AuthenticationServiceUnavailableError` propagates out of `create_task`. The Postgres row is committed and the task exists, but the caller receives a 5xx. An API client following typical retry-on-error behavior will retry the entire `create_task` call and create a second task row for the same intent — both rows will have audit columns populated but at most one (the second) will be registered in the auth graph. This is documented as acceptable and covered by the AGX1-291 runbook, but it is worth flagging that the orphan-detection approach (scan by `creator_user_id`) will not distinguish duplicate rows from a single legitimately-orphaned row if the same principal retried.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant